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Abstract 

In the noisy tensor completion problem we observe m entries (whose location is chosen uniformly at 
random) from an unknown m x ri2 x n.3 tensor T. We assume that T is entry-wise close to being rank 
r. Our goal is to fill in its missing entries using as few observations as possible. Let n = max(m, ri2, ns). 
We show that if m = n 3 / 2 r then there is a polynomial time algorithm based on the sixth level of the 
sum-of-squares hierarchy for completing it. Our estimate agrees with almost all of T’s entries almost 
exactly and works even when our observations are corrupted by noise. This is also the first algorithm for 
tensor completion that works in the overcomplete case when r > n, and in fact it works all the way up 
to r = n 3/,2-e . 

Our proofs are short and simple and are based on establishing a new connection between noisy tensor 
completion (through the language of Rademacher complexity) and the task of refuting random constant 
satisfaction problems. This connection seems to have gone unnoticed even in the context of matrix 
completion. Furthermore, we use this connection to show matching lower bounds. Our main technical 
result is in characterizing the Rademacher complexity of the sequence of norms that arise in the sum-of- 
squares relaxations to the tensor nuclear norm. These results point to an interesting new direction: Can 
we explore computational vs. sample complexity tradeoffs through the sum-of-squares hierarchy? 
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1 Introduction 


Matrix completion is one of the cornerstone problems in machine learning and has a diverse range of appli¬ 
cations. One of the original motivations for it comes from the Netflix Problem where the goal is to predict 
user-movie ratings based on all the ratings we have observed so far, from across many different users. We 
can organize this data into a large, partially observed matrix where each row represents a user and each 
column represents a movie. The goal is to fill in the missing entries. The usual assumptions are that the 
ratings depend on only a few hidden characteristics of each user and movie and that the underlying matrix is 
approximately low rank. Another standard assumption is that it is incoherent, which we elaborate on later. 
How many entries of M do we need to observe in order to fill in its missing entries? And are there efficient 
algorithms for this task? 

There have been thousands of papers on this topic and by now we have a relatively complete set of 
answers. A representative result (building on earlier works by Fazel [28], Recht, Fazel and Parrilo [67], 
Srebro and Shraibman [71], Candes and Recht [19], Candes and Tao [20]) due to Keshavan, Montanari and 
Oh [50] can be phrased as follows: Suppose M is an unknown n\ x 712 matrix that has rank r but each of 
its entries has been corrupted by independent Gaussian noise with standard deviation 5. Then if we observe 
roughly 

to = (ni + n 2 )r log(ni + n 2 ) 

of its entries, the locations of which are chosen uniformly at random, there is an algorithm that outputs a 
matrix X that with high probability satisfies 


err (AT) = — V \x tJ - M hj < 0(6) . 
nin 2 “ I 


There are extensions to non-uniform sampling models [55, 24], as well as various efficiency improvements 
[47, 40]. What is particularly remarkable about these guarantees is that the number of observations needed 
is within a logarithmic factor of the number of parameters — (n\ + n 2 )r — that define the model. 

In fact, there are benefits to working with even higher-order structure but so far there has been little 
progress on natural extensions to the tensor setting. To motivate this problem, consider the Groupon 
Problem (which we introduce here to illustrate this point) where the goal is to predict user-activity ratings. 
The challenge is that which activities we should recommend (and how much a user liked a given activity) 
depends on time as well weekday/weekend, day/night, summer/fall/winter/spring, etc. or even some 
combination of these. As above, we can cast this problem as a large, partially observed tensor where the 
first index represents a user, the second index represents an activity and the third index represents the time 
period. It is again natural to model it as being close to low rank, under the assumption that a much smaller 
number of (latent) factors about the interests of the user, the type of activity and the time period should 
contribute to the rating. How many entries of the tensor do we need to observe in order to fill in its missing 
entries? This problem is emblematic of a larger issue: Can we always solve linear inverse problems when 
the number of observations is comparable to the number of parameters in the mode, or is computational 
intractability an obstacle? 

In fact, one of the advantages of working with tensors is that their decompositions are unique in important 
ways that matrix decompositions are not. There has been a groundswell of recent work that uses tensor 
decompositions for exactly this reason for parameter learning in phylogenetic trees [60] , HMMs [60] , mixture 
models [46], topic models [2] and to solve community detection [3], In these applications, one assumes access 
to the entire tensor (up to some sampling noise). But given that the underlying tensors are low-rank, can 
we observe fewer of their entries and still utilize tensor methods? 

A wide range of approaches to solving tensor completion have been proposed [56, 35, 70, 73, 61, 52, 48, 
14, 74]. However, in terms of provable guarantees none 1 of them improve upon the following naive algorithm. 
If the unknown tensor T is ni x n 2 x 773 we can treat it as a collection of ni matrices each of size n 2 x 77.3. It 

1 Most of the existing approaches rely on computing the tensor nuclear norm, which is hard to compute [39, 41], The only 
other algorithms we are aware of [48, 14] require that the factors be orthogonal. This is a rather strong assumption. First, 
orthogonality requires the rank to be at most n. Second, even when r < n, most tensors need to be “whitened” to be put in this 
form and then a random sample from the “whitened” tensor would correspond to a (dense) linear combination of the entries of 
the original tensor, which would be quite a different sampling model. 
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is easy to see that if T has rank at most r then each of these slices also has rank at most r (and they inherit 
incoherence properties as well). By treating a third-order tensor as nothing more than an unrelated collection 
of nr low-rank matrices, we can complete each slice separately using roughly to = ni(n 2 + n 3 )r log(n 2 + n 3 ) 
observations in total. When the rank is constant, this is a quadratic number of observations even though the 
number of parameters in the model is linear. 

Here we show how to solve the (noisy) tensor completion problem with many fewer observations. Let 
n\ < n 2 < ^ 13 . We give an algorithm based on the sixth level of the sum-of-squares hierarchy that can 
accurately fill in the missing entries of an unknown, incoherent ni x n 2 x 713 tensor T that is entry-wise close 
to being rank r with roughly 

to = (ni) 1 / 2 * (n 2 + n 3 )r log 4 (ni + n 2 + n 3 ) 

observations. Moreover, our algorithm works even when the observations are corrupted by noise. When 
n = ni = n 2 = n 3 , this amounts to about n 1 / 2 r observations per slice which is much smaller than what 
we would need to apply matrix completion on each slice separately. Our algorithm needs to leverage the 
structure between the various slices. 


1.1 Our Results 


We give an algorithm for noisy tensor completion that works for third-order tensors. Let T be a third-order 
n± x 77,2 x n 3 tensor that is entry-wise close to being low rank. In particular let 

r 

T = a 1 an (g) be (g) eg + A (1) 

e=i 


where erg is a scalar and ag,bi and eg are vectors of length n 3 , n 2 and n 3 respectively. Here A is a tensor 
that represents noise. Its entries can be thought of as representing model misspecification because T is not 
exactly low rank or noise in our observations or both. We will only make assumptions about the average 
and maximum absolute value of entries in A. The vectors ag, bg and eg are called factors, and we will assume 
that their norms are roughly ^ fn[ for reasons that will become clear later. Moreover we will assume that the 
magnitude of each of their entries is bounded by C in which case we call the vectors C-incoherent 2 . (Note 
that a random vector of dimension n and norm yfn will be 0(y/ log ?Zi)-incoherent with high probability.) 
The advantage of these conventions are that a typical entry in T does not become vanishingly small as we 
increase the dimensions of the tensor. This will make it easier to state and interpret the error bounds of our 
algorithm. 

Let represent the locations of the entries that we observe, which (as is standard) are chosen uniformly 
at random and without replacement. Set |0| = to. Our goal is to output a hypothesis A' that has small 
entry-wise error, defined as: 


err (A) = 


nin 2 n 3 I 


A— Tj 


i,j,k 


This measures the error on both the observed and unobserved entries of T. Our goal is to give algorithms 
that achieve vanishing error, as the size of the problem increases. Moreover we will want algorithms that 
need as few observations as possible. Here and throughout let ni < n 2 < n 3 and n = max{ni,n 2 , 77,3}. Our 
main result is: 


Theorem 1.1 (Main theorem). Suppose we are given m observations whose locations are chosen uniformly 
at random (and without replacement) from a tensor T of the form (1) where each of the factors ag,bg and 
cl are C-incoherent. Let 6 = ni n 2m j k I* And let r* = Y1l= 1 \ a A- Then there is a polynomial time 

algorithm that outputs a hypothesis X that with probability 1 — e satisfies 

eMX) < 4CV</ ( " l)1/2( " 2 + " 3)1 ° g4 " + l0S2/e + 24 

V m 

2 Incoherence is often defined based on the span of the factors, but we will allow the number of factors to be larger than any 

of the dimensions of the tensor so we will need an alternative way to ensure that the non-zero entries of the factors are spread 

out 
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provided that max^fc |Aj ; j,fc| < 


m 


log 2/e 


5. 


Since the error bound above is quite involved, let us dissect the terms in it. In fact, having an additive 
S in the error bound is unavoidable. We have not assumed anything about A in (1) except a bound on 
the average and maximum magnitude of its entries. If A were a random tensor whose entries are +5 and 
— 6 then no matter how many entries of T we observe, we cannot hope to obtain error less than <5 on the 
unobserved entries 3 . The crucial point is that the remaining term in the error bound becomes o(l) when 
m = fl((r*) 2 n 3 / 2 ) which for polylogarithmic r* improves over the naive algorithm for tensor completion 
by a polynomial factor in terms of the number of observations. Moreover our algorithm works without any 
constraints that factors at, bi and ct be orthogonal or even have low inner-product. 

In non-degenerate cases we can even remove another factor of r* from the number of observations we 
need. Suppose that T is a tensor as in (1), but let at be Gaussian random variables with mean zero and 
variance one. The factors at,bt and ct are still fixed, but because of the randomness in the coefficients at, 
the entries of T are now random variables. 


Corollary 1.2. Suppose we are given m observations whose locations are chosen uniformly at random (and 
without replacement) from a tensor T of the form (1), where each coefficient at is a Gaussian random 
variable with mean zero and variance one, and each of the factors at, bt and ct are C-incoherent. 

Further, suppose that for a 1 — o(l) fraction of the entries ofT, we have var> r/polylog(n) = V 
and that A is a tensor where each entry is a Gaussian with mean zero and variance o(V). Then there is a 
polynomial time algorithm that outputs a hypothesis X that satisfies 

x i,j,k = (l ± 

for a 1 — o(l) fraction of the entries. The algorithm succeeds with probability at least 1 — o(l) over the 
randomness of the locations of the observations, and the realizations of the random variables an and the 
entries of A. Moreover the algorithm uses m = C 6 n 3 ^ 2 r polylog(n) observations. 

In the setting above, it is enough that the coefficients ag are random and that the non-zero entries in the 
factors are spread out to ensure that the typical entry in T has variance about r. Consequently, the typical 
entry in T is about ypr. This fact combined with the error bounds in Theorem 1.1 immediately yield the 
above corollary . Remarkably, the guarantee is interesting even when r = n 3//2 ~ e (the so-called overcomplete 
case). In this setting, if we observe a subpolynomial fraction of the entries of T we are able to recover almost 
all of the remaining entries almost entirely, even though there are no known algorithms for decomposing 
an overcomplete, third-order tensor even if we are given all of its entries, at least without imposing much 
stronger conditions that the factors be nearly orthogonal [36]. 

We believe that this work is a natural first step in designing practically efficient algorithms for tensor 
completion. Our algorithms manage to leverage the structure across the slices through the tensor, instead 
of treating each slice as an independent matrix completion problem. Now that we know this is possible, 
a natural follow-up question is to get more efficient algorithms. Our algorithms are based on the sixth 
level of the sum-of-squares hierarchy and run in polynomial time, but are quite far from being practically 
efficient as stated. Recent work of Hopkins et al. [44] shows how to speed up sum-of-squares and obtain 
nearly linear time algorithms for a number of problems where the only previously known algorithms ran 
in a prohibitively large degree polynomial running time. Another approach would be to obtain similar 
guarantees for alternating minimization. Currently, the only known approaches [48] require that the factors 
are orthonormal and only work in the undercomplete case. Finally, it would be interesting to get algorithms 
that recover a low rank tensor exactly when there is no noise. 


1.2 Our approach 

All of our algorithms are based on solving the following optimization problem: 

min||X||/c s.t. 3X with — V] \X i j k - T i j k \ < 25 (2) 

(i,j,k)€.Q 

3 The factor of 2 is not important, and comes from needing a bound on the empirical error of how well the low rank part of 

T itself agrees with our observations so far. We could replace it with any other constant factor that is larger than 1. 
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and outputting the minimizer X, where || • ||ic is some norm that can be computed in polynomial time. It will 
be clear from the way we define the norm that the low rank part of T will itself be a good candidate solution. 
But this is not necessarily the solution that the convex program finds. How do we know that whatever it 
finds not only has low entry-wise error on the observed entries of T, but also on the unobserved entries too? 

This is a well-studied topic in statistical learning theory, and as is standard we can use the notion of 
Rademacher complexity as a tool to bound the error. The Rademacher complexity is a property of the 
norm we choose, and our main innovation is to use the sum-of-squares hierarchy to suggest a suitable norm. 
Our results are based on establishing a connection between noisy tensor completion and refuting random 
constraint satisfaction problems. Moreover, our analysis follows by embedding algorithms for refutation 
within the sum-of-squares hierarchy as a method to bound the Rademacher complexity. 

A natural question to ask is: Are there other norms that have even better Rademacher complexity than 
the ones we use here, and that are still computable in polynomial time? It turns out that any such norm 
would immediately lead to much better algorithms for refuting random constraint satisfaction problems than 
we currently know. We have not yet introduced Rademacher complexity yet, so we state our lower bounds 
informally: 

Theorem 1.3 (informal). For any e > 0, if there is a polynomial time algorithm that achieves error 


err(X) < r* 


7.3/2— e 


through the framework of Rademacher complexity then there is an efficient algorithm for refuting a random 
3-SAT formula on n variables with m = n 3 ^ 2 ~ e clauses. Moreover the natural sum-of-squares relaxation 
requires at least n 2e -levels in order to achieve the above error (again through the framework of Rademacher 
complexity). 

These results follow directly from the works of Grigoriev [38], Schoenebeck [68] and Feige [29]. There are 
similar connections between our upper bounds and the work of Coja-Oghlan, Goerdt and Lanka [25] who 
give an algorithm for strongly refuting random 3-SAT. In Section 2 we explain some preliminary connections 
between these fields, at which point we will be in a better position to explain how we can borrow tools from 
one area to address open questions in another. We state this theorem more precisely in Corollary 2.13 and 
Corollary 5.6, which provide both conditional and unconditional lower bounds that match our upper bounds. 


1.3 Computational vs. Sample Complexity Tradeoffs 

It is interesting to compare the story of matrix completion and tensor completion. In matrix completion, we 
have the best of both worlds: There are efficient algorithms which work when the number of observations 
is close to the information theoretic minimum. In tensor completion, we gave algorithms that improve 
upon the number of observations needed by a polynomial factor but still require a polynomial factor more 
observations than can be achieved if we ignore computational considerations. We believe that for many other 
linear inverse problems (e.g. sparse phase retrieval), there may well be gaps between what can be achieved 
information theoretically and what can be achieved with computationally efficient estimators. Moreover, 
proving lower bounds against the sum-of-squares hierarchy offers a new type of evidence that problems are 
hard, that does not rely on reductions from other average-case hard problems which seem (in general) to 
be brittle and difficult to execute while preserving the naturalness of the input distribution. In fact, even 
when there are such reductions [12], the sum-of-squares hierarchy offers a methodology to make sharper 
predictions for questions like: Is there a quasi-polynomial time algorithm for sparse PCA, or does it require 
exponential time? 


Organization 

In Section 2 we introduce Rademacher complexity, the tensor nuclear norm and strong refutation. We 
connect these concepts by showing that any norm that can be computed in polynomial time and has good 
Rademacher complexity yields an algorithm for strongly refuting random 3-SAT. In Section 3 we show 
how a particular algorithm for strong refutation can be embedded into the sum-of-squares hierarchy and 
directly leads to a norm that can be computed in polynomial time and has good Rademacher complexity. 
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In Section 4 we establish certain spectral bounds that we need, and prove our main upper bounds. In 
Section 5 we prove lower bounds on the Rademacher complexity of the sequence of norms arising from the 
sum-of-squares hierarchy by a direct reduction to lower bounds for refuting random 3-XOR. In Appendix A 
we give a reduction from noisy tensor completion on asymmetric tensors to symmetric tensors. This is what 
allows us to extend our analysis to arbitrary order d tensors, but the proofs are essentially identical to those 
in the d = 3 case but more notationally involved so we omit them. 

2 Noisy Tensor Completion and Refutation 

Here we make the connection between noisy tensor completion and strong refutation explicit. Our first step 
is to formulate a problem that is a special case of both, and studying it will help us clarify how notions from 
one problem translate to the other. 

2.1 The Distinguishing Problem 

Here we introduce a problem that we call the distinguishing problem. We are given random observations 
from a tensor and promised that the underlying tensor fits into one of the two following categories. We want 
an algorithm that can tell which case the samples came from, and succeeds using as few observations as 
possible. The two cases are: 

1. Each observation is chosen uniformly at random (and without replacement) from a tensor T where 
independently for each entry we set 

{ CLidjCik with probability 7/8 

1 with probability 1/16 

— 1 else 

where a is a vector whose entries are ±1. 

2. Alternatively, each observation is chosen uniformly at random (and without replacement) from a tensor 
T each of whose entries is independently set to either +1 or —1 and with equal probability. 

In the first case, the entries of the underlying tensor T are predictable. It is possible to guess a 15/16 fraction 
of them correctly, once we have observed enough of its entries to be able to deduce a. And in the second 
case, the entries of T are completely unpredictable because no matter how many entries we have observed, 
the remaining entries are still random. Thus we cannot predict any of the unobserved entries better than 
random guessing. 

Now we will explain how the distinguishing problem can be equivalently reformulated in the language of 
refutation. We give a formal definition for strong refutation later (Definition 2.10), but for the time being 
we can think of it as the task of (given an instance of a constraint satisfaction problem) certifying that there 
is no assignment that satisfies many of the clauses. We will be interested in 3-XOR formulas, where there 
are n variables iq, V 2 , ■■■, v n that are constrained to take on values +1 or — 1. Each clause takes the form 

Vi * Vj * Vk — 

where the right hand side is either +1 or — 1. The clause represents a parity constraint but over the domain 
{+1, —1} instead of over the usual domain F 2 . We have chosen the notation suggestively so that it hints at 
the mapping between the two views of the problem. Each observation maps to a clause tq • Vj ■ v *. = Ti.j.k 

and vice-versa. Thus an equivalent way to formulate the distinguishing problem is that we are given a 3-XOR 
formula which was generated in one of the following two ways: 

1. Each clause in the formula is generated by choosing an ordered triple of variables (iq, Vj,Vk) uniformly 
at random (and without replacement) and we set 

{ aidjak with probability 7/8 

1 with probability 1/16 

— 1 else 
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where a is a vector whose entries are ±1. Now a represents a planted solution and by design our 
sampling procedure guarantees that many of the clauses that are generated are consistent with it. 

2. Alternatively, each clause in the formula is generated by choosing an ordered triple of variables 
( Vi , Vj,Vk) uniformly at random (and without replacement) and we set W; • Vj ■ Vk = where is 
a random variable that takes on values +1 and —1. 

In the first case, the 3-XOR formula has an assignment that satisfies a 15/16 fraction of the clauses in 
expectation by setting Vi = a,i. In the second case, any fixed assignment satisfies at most half of the clauses 
in expectation. Moreover if we are given 12(n log ro) clauses, it is easy to see by applying the Chernoff bound 
and taking a union bound over all possible assignments that with high probability there is no assignment 
that satisfies more than a 1/2 + o(l) fraction of the clauses. 

This will be the starting point for the connections we establish between noisy tensor completion and 
refutation. Even in the matrix case these connections seem to have gone unnoticed, and the same spectral 
bounds that are used to analyze the Rademacher complexity of the nuclear norm [71] are also used to refute 
random 2-SAT formulas [37], but this is no accident. 


2.2 Rademacher Complexity 


Ultimately our goal is to show that the hypothesis X that our convex program finds is entry-wise close to 
the unknown tensor T. By virtue of the fact that X is a feasible solution to (2) we know that it is entry-wise 
close to T on the observed entries. This is often called the empirical error: 

Definition 2.1. For a hypothesis X, the empirical error is 

emp-err(X) = — ^ \ x i,j,k ~ T it j tk \ 


Recall that err(A') is the average entry-wise error between X and T, over all (observed and unobserved) 
entries. Also recall that among the candidate X’s that have low empirical error, the convex program finds 
the one that minimizes ||AT||/c for some polynomial time computable norm. The way we will choose the norm 
|| • ||jt and our bound on the maximum magnitude of an entry of A will guarantee that the low rank part 
of T will with high probability be a feasible solution. This ensures that ||X|k for the X we find is not too 
large either. One way to bound err(X) is to show that no hypothesis in the unit norm ball can have too 
large a gap between its error and its empirical error (and then dilate the unit norm ball so that it contains 
X). With this in mind, we define: 

Definition 2.2. For a norm || • |k and a set 12 of observations, the generalization error is 


sup 

ll*lk<i 


err (A) 


emp-err(X) 


It turns out that one can bound the generalization error via the Rademacher complexity. 

Definition 2.3. Let 12 = {(ii, ji, kf), ( 12 ,^ 2 , ^ 2 ), (i m , jm, k m )} be a set of m locations chosen uniformly 
at random (and without replacement) from [ni] x [n 2 ] x [n 3 ]. And let ay, 02 ,..., ay be random ±1 variables. 
The Rademacher complexity of (the unit ball of) the norm || • |k is defined as 


• |k) = E 

il,<J 


sap ) &iX'jpj 

ll*lk<i 1 e=1 


It follows from a standard symmetrization argument from empirical process theory [51, 11] that the 
Rademacher complexity does indeed bound the generalization error. 

Theorem 2.4. Let e £ (0, 1 ) and suppose each X with ||X|k < 1 has bounded loss — i.e. \X it j ^ — T* j^\ < a 
and that locations ( i,j,k ) are chosen uniformly at random and without replacement. Then with probability 
at least 1 — e, for every X with ||A'|k < 1, we have 


err(X) < emp-err(X) + 2 R n 


:) + 2 a 


ln(l/e) 
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We repeat the proof here following [11] for the sake of completeness but readers familiar with Rademacher 
complexity can feel free to skip ahead to Definition 2.5. The main idea is to let ST be an independent set of 
m samples from the same distribution, again without replacement. The expected generalization error is: 


E 

n L 


sup 

imk<i 


Then we can write 


(*) = E 

n L 


< E 


sup 

II-Y||k<1 


sup 


\Xj e ,j e ,k e Tie,jt,k e I E [I ^i,j,k Ti,j,k 

m [ l,J,k 


m ^ m 

— Ye I Xie,je,ke ~ Ti u j u k f \ - — E[^ \Xi' v j’ v k' e ~ Ti' e ,j' e ,k' e \ 


(*) 


1= 1 


i= 1 


°’ n ' L ll-Ylk<l 


^ m 

— (5Z ~ Ti e j tt k e \ — \^i' e ,j' e ,k' e ~~ Ti' v 3' e ,k' t \j 


where the last line follows by the concavity of sup(-). Now we can use the Rademacher (random ±1) variables 
{at}i and rewrite the right hand side of the above expression as follows: 


(*) < E 

n.fi'.cr L 


< E 


sup 

I|A|| k <1 


sup 


y ' y a i[ \Xi e ,ji,,k e 1ie,je,k e \ \^i' e ,j' e ,k' e T 


1=1 




n ’ Q '’ a L l|A'lk<l 


^ m m 

— Ye I ■Xii ,jt ,k e ~ Ti e ,3i,ki I + — Y1 Gt I ,j' e ,k' t ~ ^i’ v j' v k' e 


e=i 


< 2 E 

17 ,a l 


< 2 E 

17,(7 L 


sup 

ll-Y|k<l 


sup 

ll-Y|k<l 


m 

-(YMX 

i=i 

m 

sG>(l 


H, U,ke Tie 


ie,je,ke I 


,uM l) 

I Ti e ,je,ke l) 


< 2 E 

17,(7 . 


2 E 

17,(7 L 




je,k e 


1=1 


1+2E 

sup 

J 17,(7 

VI 

5 


m ' 


ie,je,ke 


e=i 


/* a ^Ti e j f , 




£=1 


1+2E 

sup 

J 17,(7 

VI 

5 


m 


a cXi e j e ,k t 


t =1 


where the second, fourth and fifth inequalities use the triangle inequality. The equality uses the fact that the 
cr/s are random signs and hence can absorb the absolute value around the terms that they multiply. The 
second term above in the last expression is exactly the Rademacher complexity that we defined earlier. This 
argument only shows that the Rademacher complexity bounds the expected generalization error. However 
it turns out that we can also use the Rademacher complexity to bound the generalization error with high 
probability by applying McDiarmid’s inequality. See for example [5]. We also remark that generalization 
bounds are often stated in the setting where samples are drawn i.i.d., but here the locations of our observations 
are sampled without replacement. Nevertheless for the settings of m we are interested in, the fraction of 
our observations that are repeats is o(l) — in fact it is subpolynomial — and we can move back and forth 
between both sampling models at negligible loss in our bounds. 

In much of what follows it will be convenient to think of Q = {{ii,ji, fci), (* 2 ,^ 2 , ^ 2 ), ( i m ,j m , k m )j and 

{aeje as being represented by a sparse tensor Z, defined below. 

Definition 2.5. Let Z be an n\ x n -2 x 713 tensor such that 

_ Jo, if (i,j,k) i n 

^i,j,k ~ 

s.t. (i,j,k)=(i e ,je,k e ) <J( - 

This definition greatly simplifies our notation. In particular we have 

m 

1=1 
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where we have introduced the notation ( • , ■ ) to denote the natural inner-product between tensors. Our 
main technical goal in this paper will be to analyze the Rademacher complexity of a sequence of successively 
tighter norms that we get from the sum-of-squares hierarchy, and to derive implications for noisy tensor 
completion and for refutation from these bounds. 


2.3 The Tensor Nuclear Norm 


Here we introduce the tensor nuclear norm and analyze its Rademacher complexity. Many works have 
suggested using it to solve tensor completion problems [56, 70, 74]. This suggestion is quite natural given 
that it is based on a similar guiding principle as that which led to td-minimization in compressed sensing 
and the nuclear norm in matrix completion [28]. More generally, one can define the atomic norm for a wide 
range of linear inverse problems [23], and the i\ -norm, the nuclear norm and the tensor nuclear norm are all 
special cases of this paradigm. Before we proceed, let us first formally define the notion of incoherence that 
we gave in the introduction. 

Definition 2.6. A length rii vector a is C-incoherent if ||a|| = y/nf and Halloo < C. 

Recall that we chose to work with vectors whose typical entry is a constant so that the entries in T do 
not become vanishingly small as the dimensions of the tensor increase. We can now define the tensor nuclear 
norm 4 : 


Definition 2.7 (tensor nuclear norm). Let A C K"ix»2xn 3 be defined as 




X s.t. 3 distribution fi on triples of C-incoherent vectors with X,j k = E 

(a,6,c)-f 


?'Cfc] | 


The tensor nuclear norm of X which is denoted by ||X ||_4 is the infknum over a such that X/a £ A. 

In particular ||T — A||^i < r*. Finally we give an elementary bound on the Rademacher complexity of the 
tensor nuclear norm. Recall that n = max(ni, ri 2 , 713 ). 

Lemma 2.8. R m (\\ ■ |U) = 0(C 3 y /^) 

Proof. Recall the definition of Z given in Definition 2.5. With this we can write 


E 

Q,a 


m 


sup / j 

= E 

sup \{Z,a®b®c)\ 

\mu<i 1 e=1 


C-incoherent a,b,c 


We can now adapt the discretization approach in [33], although our task is considerably simpler because 
we are constrained to C-incoherent a’s. In particular, let 


S = 


{a 


a is C-incoherent and a £ 



By standard bounds on the size of an e-net [58] , we get that |S| < 0[C/e) n . Suppose that |(Z,a(g>&<g>c)| < M 
for all a,b,c £ S. Then for an arbitrary, but C-incoherent a we can expand it as a = '}2 i e l a 1 where each 
a 1 £ S and similarly for b and c. And now 


\{Z,a®b®c)\ <^^^eVe fe |(ZX<g>5 i <g>C)| < (1 - e)~ 3 M 
i j k 


Moreover since each entry in a 0 6 0 c has magnitude at most C 3 we can apply a Chernoff bound to conclude 
that for any particular a, 6 , c G S we have 

| (Z, a 0 6 0 c) | < O \Jm log 1 / 7 ^ 

4 The usual definition of the tensor nuclear norm has no constraints that the vectors a , b and c be C-incoherent. However, 
adding this additional requirement only serves to further restrict the unit norm ball, while ensuring that the low rank part of T 
(when scaled down) is still in it, since the factors of T are anyways assumed to be C-incoherent. This makes it easier to prove 
recovery guarantees because we do not need to worry about sparse vectors behaving very differently than incoherent ones, and 
since we are not going to compute this norm anyways this modification will make our analysis easier. 










with probability at least 1 — 7 . Finally, if we set 7 = (^) " and we set e = 1/2 we get that 

R m {A) < (1 ~ £) 3 max |(Z,a® 6 ®c)| =o(c\[^) 
m a,b,c£S n V V m/ 

and this completes the proof. □ 

The important point is that the Rademacher complexity of the tensor nuclear norm is o(l) whenever 
to = u)(n). In the next subsection we will connect this to refutation in a way that allows us to strengthen 
known hardness results for computing the tensor nuclear norm [39, 41] and show that it is even hard to 
compute in an average-case sense based on some standard conjectures about the difficulty of refuting random 
3-SAT. 


2.4 From Rademacher Complexity to Refutation 

Here we show the first implication of the connection we have established. Any norm that can be computed in 
polynomial time and has good Rademacher complexity immediately yields an algorithm for strongly refuting 
random 3-SAT and 3-XOR formulas. Next let us finally define strong refutation. 

Definition 2.9. For a formula (j), let opt(</>) be the largest fraction of clauses that can be satisfied by any 
assignment. 

In what follows, we will use the term random 3-XOR formula to refer to a formula where each clause is 
generated by choosing an ordered triple of variables (vi,Vj,vif) uniformly at random (and without replace¬ 
ment) and setting 77 ■ Vj ■ Vk = z where z is a random variable that takes on values +1 and — 1 . 

Definition 2.10. An algorithm for strongly refuting random 3-XOR takes as input a 3-XOR formula (j> and 
outputs a quantity alg {(j>) that satisfies 

1. For any 3-XOR formula <j> , opt(^) < alg(</>) 

2. If (j> is a random 3-XOR formula with to clauses, then with high probability alg(</>) = 1/2 + o(l) 

This definition only makes sense when to is large enough so that opt(</>) = 1/2 + o(l) holds with high 
probability, which happens when to = u>{n). The goal is to design algorithms that use as few clauses as 
possible, and are able to certify that a random formula is indeed far from satisfiable (without underestimating 
the fraction of clauses that can be satisfied) and to do so as close as possible to the information theoretic 
threshold. 

Now let us use a polynomial time computable norm II • Ik that has good Rademacher complexity to 
give an algorithm for strongly refuting random 3-XOR. As in Section 2.1, given a formula <p we map its to 
clauses to a collection of to observations according to the usual rule: If there are n variables, we construct 
an n x n x n tensor Z where for each clause of the form Vi ■ Vj ■ Vk = z i,j.k we put the entry Zij : k at location 
{i, j, k). All the rest of the entries in Z are set to zero. We solve the following optimization problem: 

max 77 s.t. 3X with ||X||/c < 1 and — (Z,X) > 2 r] (3) 

TO 

Let 77 * be the optimum value. We set alg(</>) = 1/2 + 77 *. What remains is to prove that the output of this 
algorithm solves the strong refutation problem for 3-XOR. 

Theorem 2.11. Suppose that || • ||^ is computable in polynomial time and satisfies ||X||k; < 1 whenever 
X = a <%) a® a and a is a vector with ±1 entries. Further suppose that for any X with ||X||k; < 1 its entries 
are bounded by C 3 in absolute value. Then (3) can be solved in polynomial time and if R m (\\ ■ ||k) = o(l) 
then setting alg((/)) = 1/2 + 77 * solves strong refutation for 3-XOR with 0(C 6 m\ogn) clauses. 

Proof. The key observation is the following inequality which relates (3) to opt (<(>). 

2 opt(((>) — 1 < — sup (Z,X) 
m I|A'||k<1 
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To establish this inequality, let v\, V 2 , v n be the assignment that maximizes the fraction of clauses satisfied. 
If we set at = Vi and A' = a® a® a we have that ||A1 ||jc < 1 by assumption. Thus X is a feasible solution. 
Now with this choice of X for the right hand side, every term in the sum that corresponds to a satisfied 
clause contributes +1 and every term that corresponds to an unsatisfied clause contributes —1. We get 
2 opt(</>) — 1 for this choice of X, and this completes the proof of the inequality above. 

The crucial point is that the expectation of the right hand side over fl and a is exactly the Rademacher 
complexity. However we want a bound that holds with high probability instead of just in expectation. It 
follows from McDiarmid’s inequality and the fact that the entries of Z and of X are bounded by 1 and by 
C 3 in absolute value respectively that if we take 0(C 6 m log n) observations the right hand side will be o(l) 
with high probability. In this case, rearranging the inequality we have 

opt(</>) < 1/2+ — sup (Z, X) 

171 imk<i 

The right hand side is exactly alg((/) and is 1/2 + o(l) with high probability, which implies that both 
conditions in the definition for strong refutation hold and this completes the proof. □ 

We can now combine Theorem 2.11 with the bound on the Rademacher complexity of the tensor nuclear 
norm given in Lemma 2.8 to conclude that if we could compute the tensor nuclear norm we would also obtain 
an algorithm for strongly refuting random 3-XOR with only m = fl(nlogn) clauses. It is not obvious but 
it turns out that any algorithm for strongly refuting random 3-XOR implies one for 3-SAT. Let us define 
strong refutation for 3-SAT. We will refer to any variable Vi or its negation Vi as a literal. We will use the 
term random 3 -SAT formula to refer to a formula where each clause is generated by choosing an ordered 
triple of literals ( 2 /i, 3 /y, J/fe) uniformly at random (and without replacement) and setting yiV yj V yk = 1 . 

Definition 2.12. An algorithm for strongly refuting random 3-SAT takes as input a 3-SAT formula f> and 
outputs a quantity alg(</) that satisfies 

1. For any 3-SAT formula q6 , opt(</) < alg(</>) 

2. If 4> is a random 3-SAT formula with m clauses, then with high probability alg(</) = 7/8 + o(l) 

The only change from Definition 2.10 comes from the fact that for 3-SAT a random assignment satisfies a 
7 /8 fraction of the clauses in expectation. Our goal here is to certify that the largest fraction of clauses that 
can be satisfied is 7/8 + o(l). The connection between refuting random 3-XOR and 3-SAT is often called 
“Feige’s XOR Trick” [29]. The first version of it was used to show that an algorithm for e-refuting 3-XOR 
can be turned into an algorithm for e-refuting 3-SAT. However we will not use this notion of refutation so 
for further details we refer the reader to [29]. The reduction was extended later by Coja-Oghlan, Goerdt 
and Lanka [25] to strong refutation, which for us yields the following corollary: 

Corollary 2.13. Suppose that || • ||k; is computable in polynomial time and satisfies ||A||^; < 1 whenever 
X = a ® a ® a and a is a vector with ±1 entries. Suppose further that for any X with ||X||k; < 1 its entries 
are bounded by C 3 in absolute value and that i? m (|| • ||k;) = o(l). Then there is a polynomial time algorithm 
for strongly refuting a random 3-SAT formula with 0(C 6 m log n) clauses. 

Now we can get a better understanding of the obstacles to noisy tensor completion by connecting it to the 
literature on refuting random 3-SAT. Despite a long line of work on refuting random 3-SAT [37, 32, 31, 30, 25], 
there is no known polynomial time algorithm that works with m = n 3 / 2 ~ € clauses for any e > 0. Feige [29] 
conjectured that for any constant C, there is no polynomial time algorithm for refuting random 3-SAT with 
to = Cn clauses 5 . Daniely et al. [26] conjectured that there is no polynomial time algorithm for m = ?r 3 / 2_e 
for any e > 0. What we have shown above is that any norm that is a relaxation to the tensor nuclear 
norm and can be computed in polynomial time but has Rademacher complexity is I? m (|| • ||jt) = o(l) for 
to = n 3 / 2-e would disprove the conjecture of Daniely et al. [26] and would yield much better algorithms for 
refuting random 3-SAT than we currently know, despite fifteen years of work on the subject. 

5 In Feige’s paper [29] there was no need to make the conjecture any stronger because it was already strong enough for all of 

the applications in inapproximability. 
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This leaves open an important question. While there are no known algorithms for strongly refuting 
random 3-SAT with m = n 3//2_£ clauses, there are algorithms that work with roughly m = n 3/ * 2 clauses 
[25]. Do these algorithms have any implications for noisy tensor completion? We will adapt the algorithm 
of Coja-Oghlan, Goerdt and Lanka [25] and embed it within the sum-of-squares hierarchy. In turn, this 
will give us a norm that we can use to solve noisy tensor completion which uses a polynomial factor fewer 
observations than known algorithms. 

3 Using Resolution to Bound the Rademacher Complexity 

3.1 Pseudo-expectation 

Here we introduce the sum-of-squares hierarchy and will use it (at level six) to give a relaxation to the tensor 
nuclear norm. This will be the norm that we will use in proving our main upper bounds. First we introduce 
the notion of a pseudo-expectation operator from [7, 8, 10]: 

Definition 3.1 (Pseudo-expectation [7]). Let k be even and let PJk denote the linear subspace of all 
polynomials of degree at most k on n' variables. A linear operator E : P£ —> M is called a degree k 
pseudo-expectation operator if it satisfies the following conditions: 

(1) E[l] = 1 ( normalization ) 

(2) E[P 2 ] > 0, for any degree at most k/2 polynomial P ( nonnegativity ) 

Moreover suppose that p G PJk with deg(p) = k'. We say that E satisfies the constraint {p = 0} if E [pq] = 0 
for every q G And we say that E satisfies the constraint {p > 0} if E [pq 2 ] > 0 for every q G Puk-k')/ 2 J ■ 

The rationale behind this definition is that if p is a distribution on vectors in R" then the operator 
E[p] = Ei- < _ M [p(F)] is a degree d pseudo-expectation operator for every d — i.e. it meets the conditions of 
Definition 3.1. However the converse is in general not true. We are now ready to define the norm that will 
be used in our upper bounds: 

Definition 3.2 ( SOSk norm). We let JCk be the set of all X G R"i xn 2 x "3 such that there exists a degree 
k pseudo-expectation operator on p™ 1+n2 + n 3 satisfying the following polynomial constraints (where the 
variables are the y/^’s) 

(a) } ) 2 = ni}, {£?=i(if >) 2 = n 2 } and {E^lf >) 2 = n 3 } 

(b) {(if } ) 2 < C 2 }, {(iff < C 2 } and {(iff < C 2 } for all i and 

(c) X iJik = E[lf } lf Vf >] for all i,j and k. 

The SOSk norm of X G R niXn2Xn 3 which is denoted by ||A’||x; fe is the infimum over a such that X/a G )Ck- 

The constraints in Definition 3.1 can be expressed as an 0(n k )~ sized semidefinite program. This implies 
that given any set of polynomial constraints of the form {p = 0}, {p > 0}, one can efficiently find a degree 
k pseudo-distribution satisfying those constraints if one exists. This is often called the degree k Sum-of- 
Squares algorithm [69, 62, 53, 63]. Hence we can compute the norm ||AC||/c fe of any tensor X to within 
arbitrary accuracy in polynomial time. And because it is a relaxation to the tensor nuclear norm which is 
defined analogously but over a distribution on C-incoherent vectors instead of a pseudo-distribution over 
them, we have that II-X'IIk*. < IUII.4 f° r every tensor X. Throughout most of this paper, we will be interested 
in the case k = 6. 
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3.2 Resolution in JC$ 

Recall that any polynomial time computable norm with good Rademacher complexity with m observations 
yields an algorithm for strong refutation with roughly m clauses too. Here we will use an algorithm for 
strongly refuting random 3-SAT to guide our search for an appropriate norm. We will adapt an algorithm 
due to Coja-Oglilan, Goerdt and Lanka [25] that strongly refutes random 3-SAT, and will instead give an 
algorithm that strongly refutes random 3-XOR. Moreover each of the steps in the algorithm embeds into 
the sixth level of the sum-of-squares hierarchy by mapping resolution operations to applications of Cauchy- 
Schwartz, that ultimately show how the inequalities that define the norm (Definition 3.2) can be manipulated 
to give bounds on its own Rademacher complexity. 

Let’s return to the task of bounding the Rademacher complexity of || • ||jc 6 . Let X be arbitrary but satisfy 
||A||k; 6 < 1. Then there is a degree six pseudo-expectation meeting the conditions of Definition 3.2. Using 
Cauchy-Schwartz we have: 

{(z,x )) 2 = (£^e K ( 1 ) y/ 2 ) y fc (3} ]) 2 ) (4) 

i j,k i j,k 

To simplify our notation, we will define the following polynomial 

Qi,z(Y^,Y^) = J2 z iJ,kYj 2) Y^ 3) 

j,k 

which we will use repeatedly. If d is even then any degree d pseudo-expectation operator satisfies the 
constraint (E[p]) 2 < E[p 2 ] for every polynomial p of degree at most d/2 (e.g., see Lemma A .4 in [6]). Hence 
the right hand side of (4) can be bounded as: 


m (£ (e [e/%- z (y< 2 \ r (3) )]) 2 ) < ni £ e [(y^q^yW,y^2) 


( 5 ) 


It turns out that bounding the right-hand side of (5) boils down to bounding the spectral norm of the 
following matrix. 


Definition 3 . 3 . Let A be the 77,2713 x 712713 matrix whose rows and columns are indexed over ordered pairs 
{j, k') and (j',k) respectively, defined as 


A 


j,k',j',k 


~ ^2 Zi dd 


We can now make the connection to resolution more explicit: We can think of a pair of observations 
Zij t fc, as a pair of 3-XOR constraints, as usual. Resolving them (i.e. multiplying them) we obtain a 

4-XOR constraint 

%j ' 'Ek * •Ej' * VCk' Zi jj^Zi ji ^i 

A captures the effect of resolving certain pairs of 3-XOR constraints into 4-XOR constraints. The challenge 
is that the entries in A are not independent, so bounding its maximum singular value will require some care. 
It is important that the rows of A are indexed by (j,k') and the columns are indexed by ( j',k ), so that j 
and j' come from different 3-XOR clauses, as do k and k ', and otherwise the spectral bounds that we will 
want to prove about A would simply not be true! This is perhaps the key insight in [25]. 

It will be more convenient to decompose A and reason about its two types of contributions separately. 
To that end, we let R be the 712113 x 112713 matrix whose non-zero entries are of the form 


Rj,k,j,k — ^2 Z i,j,kZi,j,k 


and all of its other entries are set to zero. Then let B be the 712713 x 712H3 matrix whose entries are of the 
form 

R _ /0, if j = f and k = k' 

j’ k, 'j'’ k -\XiZwZiSj, else 
By construction we have A = B + R. Finally: 
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Lemma 3.4. 


^E[(y/ 1) Q iiZ (y( 2 ),F( 3 )) 


2 i 


< C 2 ri 2 n 3 \\B\\ + C 6 m 


Proof. The pseudo-expectation operator satisfies {(yE ) 2 < C 2 } for all i, and hence we have 


[(YiQiAY™^)) 2 ] < C 2 J2 E [(Qrz(YW,YW) 


= t' 2 E E E [ z i,j,kZi,j',k' y} 2 ) y£ 3 ) yP y fe (3) 

i j,k,j',k' 


Now let y< 2 > G K ” 2 be a vector of variables where the zth entry is Yf 2 ' 1 and similarly for Y^ 3 \ Then we can 
re-write the right hand side as a matrix inner-product: 


c 2 E E 


ZiJ,k z ij’,k' E \Yj 2) Y£ 3) Yj? ) Y$ ) ] = C 2 (A , E[(y( 2 ) ® y( 3 ))(y( 2 ) ® 


i j,k,j' ,k' 


We will now bound the contribution of B and R separately. 

Claim 3 . 5 . e[(^ ( 2) <g> y( 3 ))(y( 2 ) ® y( 3 )) T ] is positive semideftnite and has trace at most 712/13 

Proof. It is easy to see that a quadratic form on E[(y( 2 ) ® y( 3 ))(y( 2 ) ® y( 3 )) T ] corresponds to E[p 2 ] for 
some p G p ” 2+ ” 3 and this implies the first part of the claim. Finally 

Tr(E[(^ (2) ® y(3))(y( 2 ) ® y(3)) T ]) = E[(y i (2) ) 2 (y fc (3) ) 2 ] < n 2 n 3 

j,k 


where the last equality follows because the pseudo-expectation operator satisfies the constraints {J27=i0^i) 2 = 
n 2 } and {ESiCEE 2 = ™ 3 >- □ 

Hence we can bound the contribution of the first term as C 2 (B, E[(y( 2 ) ® y( 3 ))(y( 2 ) ® y( 3 )) T ]]) < 
C 2 n 2 n 3 \\B\\. Now we proceed to bound the contribution of the second term: 

Claim 3.6. E[(y/ 2) ) 2 (y fe (3) ) 2 ] < C 4 

Proof. It is easy to verify by direct computation that the following equality holds: 

c 4 - (y/ 2 ) ) 2 (y fc (3) ) 2 = ( 'c 2 - (y/ 2) ) 2 ) ( c 2 - (y fc (3) ) 2 ) + ( c ,2 - (y fc ( 3 ) ) 2 )(y / 2) ) 2 + (c 2 - (y/ 2 ) ) 2 )(y fc (3) ) 2 


Moreover the pseudo-expectation of each of the three terms above is nonnegative, by construction. This 
implies the claim. □ 

Moreover each entry in Z is in the set {—1, 0, +1} and there are precisely m non-zeros. Thus the sum of 
the absolute values of all entries in R is at most m. Now we have: 


c 2 (i?,E[(y (2) ® y ( 3 ) )(y (2) ® y (3) ) T ]) < c 2 ^i? iifc!j)fc E[(y/ 2 ) ) 2 (y fc (3) ) 2 ] < c 6 m 

j,k 


And this completes the proof of the lemma. □ 

4 Spectral Bounds 

Recall the definition of B given in the previous section. In fact, for our spectral bounds it will be more 
convenient to relabel the variables (but keeping the definition intact): 

R Jo, if j = f and k = k! 

" kJ ' k ' else 
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Let us consider the following random process: For r = 1,2,O(logn) partition the set of all ordered triples 
(■ i,j,k ) into two sets S r and T r . We will use this ensemble of partitions to define an ensemble of matrices 
{B r }^ii° s "^: Set U[j k , as equal to Zij^' if () G S r and zero otherwise. Similarly set VW /)fc equal to 
Zi,j\k if G T r and zero otherwise. Also let Eijji tkt k',r be the event that there is no r' < r where 

(■ i,j,k') G S r > and (■ i,j',k ) G T r > or vice-versa. Now let 


R r 


E 


ur^vr^iE 


where 1 b is short-hand for the indicator function of the event Eij t j> t k,k',r- The idea behind this construction 
is that each pair of triples ( i,j , k') and ( i,j', k) that contributes to B will be contribute to some B r with high 
probability. Moreover it will not contribute to any later matrix in the ensemble. Hence with high probability 


O(logn) 

B = E B ' 

r= 1 

Throughout the rest of this section, we will suppress the superscript r and work with a particular matrix 
in the ensemble, B. Now let £ be even and consider 

Tr( BB r BB T ...BB T J 

e times 


As is standard, we are interested in bounding E[Tr(BB 7 BB 7 ...BB 7 )] in order to bound ||B||. But note that 
B is not symmetric. Also note that the random variables U and V are not independent, however whether or 
not they are non-zero is non-positively correlated and their signs are mutually independent. Expanding the 
trace above we have 


Tr(BB BB ...BB ) — ^ ^ 1 ••• ^ 1 Bji,ki,j 2 ,k 2 Bj 3 ,k 3 ,j 2 ,k 2 ---Bji,ki,je,ke 

jiM j 2 ,k 2 je-i,ke-t& 

= y ] y ] y ] E- ^ / 'y ] U%i,ji,k2^ii,j2,ki ^E l Ui2,j3,k 2 ^ r i2,j2,k 3 ^E 2 ---Ui e ,j 1 ,keVie,je,ki 1 E e 


ji,ki ii j 2 ,k 2 *2 je,ke it 


where l^i is the indicator for the event that the entry B j 1 ,k 1 ,j 2 ,k 2 is n °t covered by an earlier matrix in the 
ensemble, and similarly for 1 b 2 , 1b<- 

Notice that there are 2£ random variables in the above sum (ignoring the indicator variables). Moreover 
if any U or V random variable appears an odd number of times, then the contribution of the term to 
E[Tr(BB' 7 ’BB T ...BB T )] is zero. We will give an encoding for each term that has a non-zero contribution, and 
we will prove that it is injective. 

Fix a particular term in the above sum where each random variable appears an even number of times. 
Let s be the number of distinct values for i. Moreover let ■■■)*« t> e th e order that these indices first 

appear. Now let denote the number of distinct values for j that appear with i\ in U terms — i.e. r\ is the 
number of distinct j's that appear as Let r\ denote the number of distinct values for k that appear 

with *i in U terms — i.e. rf is the number of distinct fc’s that appear as or Ui 1 ,*,k- Similarly let q\ denote 
the number of distinct values for j that appear with i\ in V terms — i.e. qf is the number of distinct j’s 
that appear as V), j,*. And finally let denote the number of distinct values for k that appear with i \ in 
V terms — i.e. q f is the number of distinct fc’s that appear as 

We give our encoding below. It is more convenient to think of the encoding as any way to answer the 
following questions about the term. 


(a) What is the order ?i, i 2 , of the first appearance of each distinct value of i! 

(b) For each i that appears, what is the order of each of the distinct values of j’s and fc’s that appear along 
with it in U1 Similarly, what is the order of each of the distinct values of j’s and fc’s that appear along 
with it in V? 
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(c) For each step (i.e. a new variable in the term when reading from left to right), has the value of i been 
visited already? Also, has the value for j or k that appears along with U been visited? Has the value 
for j or k that appears along with V been visited? Note that whether or not j or k has been visited 
(together in U) depends on what the value of i is, and if * is a new value then the j or k value must 
be new too, by definition. Finally, if any value has already been visited, which earlier value is it? 

Let rj = r{+r J 2 + ... + r{ and = r*+r 2 + ... + r*. Similarly let qj = q{+q 2 +...q 3 s and q k = q^+q 2 + ...q^. 
Then the number of possible answers to (a) and ( 6 ) is at most nf and n 2 J n^ k n 2 J n^ k respectively. It is also easy 
to see that the number of answers to (c) that arise over the sequence of £ steps is at most 8 e {s(rj+r k )(qj+q k )) e . 
We remark that much of the work on bounding the maximum eigenvalue of a random matrix is in removing 
any £ e type terms, and so one needs to encode re-visiting indices more compactly. However such terms will 
only cost us poly logarithmic factors in our bound on ||H||. 

It is easy to see that this encoding is injective, since given the answers to the above questions one can 
simulate each step and recover the sequence of random variables. Next we establish some easy facts that 
allow us to bound E[Tr(BB 7 BB T ...BB T )]. 

Claim 4.1. For any term that has a non-zero contribution to E[Tr(BB 7 BB 7 ...BB 7 )], we must have s < £/2 
and rj + qj + r k + q k < £ 

Proof. Recall that there are 2£ random variables in the product and precisely t of them correspond to U 
variables and £ of them to V variables. Suppose that s > i/2. Then there must be at least one U variable 
and at least one V variable that occur exactly once, which implies that its expectation is zero because the 
signs of the non-zero entries are mutually independent. Similarly suppose rj + qj + r k + q k > £■ Then there 
must be at least one U or V variable that occurs exactly once, which also implies that its expectation is 
zero. □ 

Claim 4.2. For any valid encoding, s < rj + qj and s < r k + q k . 

Proof. This holds because in each step where the i variable is new and has not been visited before, by 
definition the j variable is new too (for the current i) and similarly for the k variable. □ 

Finally, if s,rj,qj,r k and q k are defined as above then for any contributing term 

^ii ,ji,k 2 ^»i, 32 ,fei ^*2 J3,k 2 Vi 2 ,j 2 ,k 3 ■ ■ ■ Ui e ,ji,l e ,je,ki 

its expectation is at most p r o+ r kp c ij+ik where p = m/n\n 2 n 3 because there are exactly rj + r k distinct U 
variables and qj + q k distinct V variables whose values are in the set {—1,0,+1} and whether or not a 
variable is non-zero is non-positively correlated and the signs are mutually independent. 

This now implies the main lemma: 

Lemma 4.3. E[Tr(BB T BB T ...BB T )] < nf 2 (max(n 2 , n 3 ))V {£) 3W 

Proof. Note that the indicator variables only have the effect of zeroing out some terms that could otherwise 
contribute to E[Tr(BB 7 BB 7 ...BB 7 )]. Returning to the task at hand, we have 

E[Tr(BB T BB T ...BB T )] < ^ nln 2 J n r 3 k n q 2 j n q 3 k p r ^ +rk p^ +<lk 8 e {s(rj + r k )( qj + q k )f 

s,rj,r k ,qj,q k 

where the sum is over all valid triples s, rj,r k , qj,q k and hence s,r,q < £/2 and s < rj + r k and s < qj + q k 
using Claim 4.1 and Claim 4.2. We can upper bound the above as 

E[Tr(BB T BB T ...BB T )] < ^ n\(pn 2 ) ri+qj {pn 3 ) rk+qk {i) 3e+3 

s,rj,r k ,qj,q k 

< ^ n{(pma,x(n 2 ,n 3 )) rj+q:i+rk+qk (£) 3e+3 

s,rj,r k ,qj,q k 

Now if pmax(n 2 , n. 3 ) < 1 then using Claim 4.2 followed by the first half of Claim 4.1 we have: 

E[Tr(BB T BB r ...BB T )] < rt s 1 (j)max(ii 2 , n 3 )) 2s (£) u+3 < n/ 2 (pmax(n 2 ,n 3 )) e (i) 3e+3 
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1/2 

where the last inequality follows because pnf max(ri 2 , td) > 1- Alternatively if pmax(ri 2 , n 3 ) > 1 then we 
can directly invoke the second half of Claim 4.1 and get: 

E[Tr(BB T BB T ...BB T )] < n?(pmax(n 2 ,n 3 ))V ) 3£+3 < nf(pmax(n 2 , n 3 )f(£) 3e+3 


Hence E[Tt(BB t BB t ...BB t )] < to / 2 max(?z 2 , n 3 )V (^) 3 ^ +3 and this completes the proof. □ 

As before, let n = max(ni, ri 2 , n 3 ). Then the last piece we need to bound the Rademacher complexity is 
the following spectral bound: 

Theorem 4.4. With high probability, IIBII < o( 1 /f nl ° s n -1 

Knf min(ra 2 ,n3) / 

Proof. We proceed by using Markov’s inequality: 


Pr[||B|| > n\^ 2 max(?i 2 , n 3 )p(2£) 3 } = Pr 


||B|| £ > (fn\^ 2 max(ri 2 ,n 3 )p(2f) 3 ^ 


E[Tr(BB T BB T ...BB T )] l 3 

to / 2 max(ri 2 , n 3 Yp l {2l) u 


and hence setting £ = 0(logn) we conclude that || B|| < 8 n\^ 2 max(ri 2 , n 3 )plog 3 n holds with high probability. 
Moreover B = B r also holds with high probability. If this equality holds and each B r satisfies 

||B 1 "|| < 8n,y 2 ma,x(n, 2 , ?r 3 )plog 3 n, we have 


||H|| < maxO(||B r || logn) = Oi^ 


ny 2 min(n 2 ,n 3 )' 


where we have used the fact that p = 7 n/niri 2 n 3 . This completes the proof of the theorem. 


□ 


Proofs of Theorem 1.1 and Corollary 1.2 

We can now bound the Rademacher complexity of the norm that we get from the six level sum-of-squares 
relaxation to the tensor nuclear norm: 

Theorem 4.5. R m (\\ ■ ||*; 6 ) < 

Proof. Consider any X with ||AC||/c 6 < 1. Then using Lemma 3.4 and Theorem 4.4 we have 

fz.x)) <»,(£(£ Zij kXij k'j < C 2 niri2n3\\B\\ +C 6 mrii = 0 ^mn \^ 2 max(ri 2 , n 3 ) log 4 n + mn^j 

i j,k 

Recall that Z was defined in Definition 2.5. The Rademacher complexity can now be bounded as 

< o(./ (" 1 ) 1,a (^ + "») l0g4 " ) 

to V y to / 

which completes the proof of the theorem. □ 

Recall that bounds on the Rademacher complexity readily imply bounds on the generalization error (see 
Theorem 2.4). We can now prove Theorem 1.1: 

Proof. We solve (2) using the norm || • ||jt 6 . Since this norm comes from the sixth level of the sum-of-squares 
hierarchy, it follows that ( 2 ) is an n 6 -sized semidefinite program and there is an efficient algorithm to solve 
it to arbitrary accuracy. Moreover we can always plug in A' = T — A and the bounds on the maximum 
magnitude of an entry in A together with the Chernoff bound imply that with high probability X = T — A 
is a feasible solution. Moreover ||T — A||jc 6 < r*. Hence with high probability, the minimizer A' satisfies 
||A||k 6 < r*. Now if we take any such X returned by the convex program, because it is feasible its empirical 
error is at most 26. And since ||AT||k ; 6 < r* the bounds on the Rademacher complexity (Theorem 4.5) together 
with Theorem 2.4 give the desired bounds on err(A) and complete the proof of our main theorem. □ 
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Finally we prove Corollary 1.2: 

Proof. Our goal is to lower bound the absolute value of a typical entry in T. To be concrete, suppose that 
var(Tj Jj fe) > /(r, n ) for a 1 — o(l) fraction of the entries where f(r, n) = r 1 / 2 / log 15 n. Consider Ti t j ik , which 
we will view as a degree three polynomial in Gaussian random variables. Then the anti-concentration bounds 
of Carbery and Wright [21] now imply that | Tij tk \ > f(r,n)/\ogn with probability 1 — o(l). With this in 
mind, we define 

K= {( i,j,k ) s.t. \T iJtk \ > f(r, n)/log n} 

and it follows form Markov’s bound that that \1Z\ > (1 — o(l))nin 2 U 3 . Now consider just those entries in 1Z 
which we get substantially wrong: 

W = {( i,j,k ) s.t. (i,j,k) e U and \X itj>k - T itjtk I > 1 /logn} 

We can now invoke Theorem 1.1 which guarantees that the hypothesis X that results from solving (2) 
satisfies err(X) = o(l/logn) with probability 1 —o(l) provided that m = ff(n 3 / 2 r). This bound on the error 
immediately implies that \R!\ = 0 ( 711 ^ 12 / 13 ) and so \1Z\TZ'\ = (1 — o(l)) 7 ii/i 27 i 3 . This completes the proof of 
the corollary. □ 

5 Sum-of-Squares Lower Bounds 

Here we will show strong lower bounds on the Rademacher complexity of the sequence of relaxations to the 
tensor nuclear norm that we get from the sum-of-squares hierarchy. Our lower bounds follow as a corollary 
from known lower bounds for refuting random instances of 3-XOR [38, 68 ]. First we need to introduce the 
formulation of the sum-of-squares hierarchy used in [ 68 ]: We will call a Boolean function / a fc-junta if there 
is set S C [n] of at most fc variables so that / is determined by the values in S. 

Definition 5.1. The fc-round Lasserre hierarchy is the following relaxation: 

(a) ||uo || 2 = 1, Hucll 2 = 1 for all C e C 

(b) ( Vf,Vg ) = (Vft,v g >) for all f,g,f,g' that are fc-juntas and f ■ g = f ■ g' 

(c) Vf + v g = Vf+g for all /, g that are fc-juntas and satisfy f ■ g = 0 

Here we define a vector Vf for each fc-junta, and C is a class of constraints that must be satisfied by any 

Boolean solution (and are necessarily ^-juntas themselves). See [ 68 ] for more background, but it is easy to 

construct a feasible solution to the above convex program given a distribution on feasible solutions for some 
constraint satisfaction problem. In the above relaxation, we think of functions / as being {0, l}-valued. It 
will be more convenient to work with an intermediate relaxation where functions are {— 1 , l}-valued and the 
intuition is that us for some set S C [n] should correspond to the vector for the character \S- 

Definition 5.2. Alternatively, the fc-round Lasserre hierarchy is the following relaxation: 

(a) ||zi 0 || 2 = 1, (uq,,u s ) = (~l) Zs for all (®s,Z s ) e C 

(b) (us, ut) = (us',ut') for sets S, T, S', T' that are size at most fc and satisfy SAT = S'AT', where A is 
the symmetric difference. 

Here we have explicitly made the switch to XOR-constraints — namely (®g, Zs) has Z$ € {0,1} and 
correspond to the constraint that the parity on the set S is equal to Zs- Now if we have a feasible solution 
to the constraints in Definition 5.1 where all the clauses are XOR-constraints, we can construct a feasible 
solution to the constraints in Definition 5.2 as follows. If S' is a set of size at most fc, we define 


u S = v g - Vf 


where / is the parity function on S and g = 1 — / is its complement. Moreover let u 0 = Vq. 
Claim 5.3. {us} is a feasible solution to the constraints in Definition 5.2 
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Proof. Consider Constraint ( b ) in Definition 5.2, and let S,T,S',T' be sets of size at most k that satisfy 
S © T = S' ® T'. Then our goal is to show that 

( V 9S - Vfs,v gT -Vf T ) = (Vg s , - V fs ,, Vg T , - v f T ,) 

where fs is the parity function on 5, and similarly for the other functions. Then we have fs‘fr = fs’ ■ /t' 
because S(BT = S' ®T', and this implies that (v/ s ,«/ T ) = {vf s ,, Vf T ,). An identical argument holds for the 
other terms. This implies that all the Constraints ( b ) hold. Similarly suppose (®g, Z$) £ C. Since fs-gs = 0 
and fs + gs = 1 it is well-known that (1) Vf s and v gs are orthogonal (2) Vf s +v gs = Vq and (3) since fs £ C 
in Dehnition 5.1, we have v gs = 0 (see [68]). Thus 

( u 0 , m s ) = ( v 0 ,v gs ) - {v 0 ,v fs ) = -1 

and this completes the proof. □ 


Now following Barak et al. [ 6 ] we can use the constraints in Definition 5.2 to define the operator E[-]. In 
particular, given p £ PJf where p = J2s C S ILeS ^ mid P multilinear, we set 

®b)] = £<*<«> 

s 


Here we will also need to define E[p] when p is not multilinear, and in that case if Y) appears an even number 
of times we replace it with 1 and if it appears an odd number of times we replace it by Yi to get a multilinear 
polynomial q and then set E[p] = E[g], 

Claim 5.4. E [-] is a feasible solution to the constraints in Definition 3.2, and for any £ C we have 

E[n ies ^] = (-i) Zs - 

Proof. Then by construction E[l] = 1, and the proof that E [p 2 ] > 0 is given in [ 6 ], but we repeat it here for 
completeness. Let p = ? eg L) be multilinear where we follow the above recipe and replace terms of 

the form Yff with (1 jn) as needed. Then p 2 = t c s c t IliGS ^ IIigt ^ an d moreover 


E[p 2 ] 


CsCt(u<D,UsAt) = ^2 c SCt(us , Ut) 
S.T S,T 


s 


2 

> o 


as desired. Next we must verify that E[-] satisfies the constraints Y 2 = n} and {Y 2 < C 2 } for all 

i £ {1, 2,..., n}, in accordance with Definition 3.1. To that end, observe that 


E 


n 

(X Y ?-n)q 


= 0 


i=l 


which holds for any polynomial q £ PJf _ 2 . Finally consider 


E 


(C 2 -Y 2 )q 2 ] =E[(c 2 -l)g 


> 0 


which follows because C 2 > 1 and holds for any polynomial q £ P^ d _ d iy 2 \ • This completes the proof. □ 


Theorem 5.5. [38, 68] Let <f> be a random 3-XOR formula on n variables with m = n 3 ^ 2 ~ e clauses. Then 
for any e > 0 and any c < 2, the k = f l(n ce ) round Lasserre hierarchy given in Definition 5.1 permits a 
feasible solution, with probability 1 — o(l). 


Note that the constant in the f](•) depends on e and c. Then using the above reductions, we have the 
following as an immediate corollary: 

Corollary 5.6. For any e > 0 and any c < 2 and k = fl(n ce ), if m = n 3 / 2_e the Rademacher complexity 

Rm {\\ ■ IkJ = i-o(i). 


Thus there is a sharp phase transition (as a function of the number of observations) in the Rademacher 
complexity of the norms derived from the sum-of-squares hierarchy. At level six, R rn (|| • ||ic 6 ) = o(l) whenever 
m = w(n 3 / 2 log 4 n). In contrast, R m {\\ ■ ||ic fc ) = 1 — o(l) when m = n 3 / 2-e even for very strong relaxations 

o . 2e 

derived from n 2e rounds of the sum-of-squares hierarchy. These norms require time 2 n to compute but still 
achieve essentially no better bounds on their Rademacher complexity. 
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A Reduction from Asymmetric to Symmetric Tensors 


Here we give a general reduction, and show that any algorithm for tensor prediction that works for symmetric 
tensors can be used to predict the entries of an asymmetric tensor too. Hardt gave a related reduction for 
the cases of matrices [40] and it is instructive to first understand this reduction, before proceeding to the 
tensor case. Suppose we are given a matrix M that is not necessarily symmetric. Then the approach of [40] 
is to construct the following symmetric matrix: 


0 M T 
M 0 


We have not precisely defined the notion of incoherence that is used in the matrix completion literature, but 
it turns out to be easy to see that S is low rank and incoherent as well. 

The important point is that given m samples generated uniformly at random from M, we can generate 
random samples from S too. It will be more convenient to think of these random samples as being generated 
without replacement, but this reduction works just as well without replacement too. Let M £ R raiXn2 . Now 

n 2 +n 2 

for each sample from S, with probability p = we reveal a uniformly random entry in the either 

block of zeros. And with probability l-pwe reveal a uniformly random entry from M. Each entry in M 
appears exactly twice in S, and we choose to reveal this entry of M with probability 1/2 from the top-right 
block, and otherwise from the bottom-left block. Thus given m samples from M, we can generate from S 
(in fact we can generate even more, because some of the revealed entries will be zeros). It is easy to see that 
this approach works for the case of sampling without replacement to, in that m samples without replacement 
from M can be used to generate at least to samples without replacement from S. 

Now let us proceed to the tensor case. Let us introduce the following definition, for ease of notation: 

Definition A.l. Let m{n,r,e,f,C) be such that, there is an algorithm that on a rank r, order d, size 
n x n x ... x n symmetric tensor where each factor has norm at most C, the algorithm returns an estimate 
X with err(Y) =/ with probability 1 — e when it is given m(n, r, e,f) samples chosen uniformly at random 
(and without replacement). 
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Lemma A.2. For any odd d, suppose we are given n ji r ^ d 1 , e if>Vd) samples chosen uniformly 

at random (and without replacement) from an n\ x 712 x ... x nd tensor 




1 ^ a? <g>... (8) af 


i—1 


where each factor is unit norm. There is an algorithm that with probability at least 1 — e returns an estimate 
Y with 

^ (Y?j=i n j) d f 

- d!2" 'll"-,"/ 

Proof. Our goal is to symmetrize an asymmetric tensor, and in such a way that each entry in the symmetrized 
tensor is either zero or else corresponds to an entry in the original tensor. Our reduction will work for any 
odd order d tensor. In particular let 


T = a) g> af <8 >... ® af 


be an order d tensor where the dimension of (P is n.j. Also let n = y rij. Then we will construct a 
symmetric, order d tensor as follows. Let ay, a 2 , ...crd be a collection of d random ± variables that are chosen 
uniformly at random from the 2 d ~ l configurations where YI < j=i a 'j = 1- Then we consider the following 
random vector 

a i (a 1 ,a 2 ,...cr d ) = l<7 1 aj,a 2 a^:^..,<7 d af] 

Here a^ay, a 2 , ...ad) is an n-dimensional vector that results from concatenating the vectors a], of,..., af but 
after flipping some of their signs according to oy,a 2 , ...cr d . Then we set 


S = 


E E ( a ^ cri 

r 2,---0-d Z ' V 



It is immediate that S is symmetric and has rank at most 2 d ~ 1 r by expanding out the expectation into a 
sum over the valid sign configurations. Moreover each rank one term in the decomposition is of the form 
a® d where ||a||| = d because it is the concatenation of d unit vectors. 

If ai, a 2l ...a d is fixed, then each entry in S is itself a degree d polynomial in the ay variables. By our 
construction of the ay variables, and because d is odd so there are no terms where every variable appears to 
an even power, it follows that all the terms vanish in expectation except for the terms which have a factor 
of n-=i <T j- an d these are exactly terms that correspond to some permutation 7r : [d] —> [d], and a term of 
the form 

d 


E 


, 7r ( 1 ) 


<g> o.7 (2) <s> 



Hence all of the entries in S are either zero or are 2 d times an entry in T. As before, we can generate m 
uniformly random samples from S given m uniformly random samples from T, by simply choosing to sample 
an entry from one of the blocks of zeros with the appropriate probability, or else revealing an entry of T and 
choosing where in S to reveal this entry uniformly at random. Hence: 


(E '=1 n ^ d (ii,i 2) ...,i d )er 


l^il, *2, ...,id — 


,s^d W E 

j=l n 3 ) 


,id Sii,i2,...,id\ 


where T represents the locations in S where an entry of T appears. The right hand side above is at most 
/ with probability 1 — e. Moreover each entry in T appears in exactly d\ locations in S. And when it does 
appear, it is scaled by 2 d ~ 1 . And hence if we multiply the left hand side by 

d \ 2 d -i J]j =1 rij 

we obtain err(F). This completes the reduction. □ 


Note that in the case where ri\ = n 2 = 713 ... = nd, the error and the rank in this reduction increase only by 
at most an e d and 2 d factor respectively. 
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